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ABSTRACT 

Summary: Bacterial genomes are simpler than mammalian ones, and 
yet assembling the former from the data currently generated by high- 
throughput short-read sequencing machines still results in hundreds of 
contigs. To improve assembly quality, recent studies have utilized 
longer Pacific Biosciences (PacBio) reads or jumping libraries to con- 
nect contigs into larger scaffolds or help assemblers resolve ambigu- 
ities in repetitive regions of the genome. However, their popularity in 
contemporary genomic research is still limited by high cost and error 
rates. In this work, we explore the possibility of improving assemblies 
by using complete genomes from closely related species/strains. We 
present Ragout, a genome rearrangement approach, to address this 
problem. In contrast with most reference-guided algorithms, where 
only one reference genome is used. Ragout uses multiple references 
along with the evolutionary relationship among these references in 
order to determine the correct order of the contigs. Additionally, 
Ragout uses the assembly graph and multi-scale synteny blocks to 
reduce assembly gaps caused by small contigs from the input assem- 
bly. In simulations as well as real datasets, we believe that for common 
bacterial species, where many complete genome sequences from 
related strains have been available, the current high-throughput 
short-read sequencing paradigm is sufficient to obtain a single high- 
quality scaffold for each chromosome. 

Availability: The Ragout software is freely available at: https://github. 

com/fenderglass/Ragout. 

Contact: spham@salk.edu 



1 INTRODUCTION 

The recent proliferation of next-generation sequencing with short 
reads has enabled many new experimental opportunities, but it 
has also raised formidable computational challenges in genome 
assembly. Even for relatively simple bacterial genomes, their 
assemblies from current generation of high-throughput short 
reads are still fragmented with hundreds of contigs. To improve 
the assembly's quality, recent studies have utilized longer Pacific 
Biosciences (PacBio) reads or jumping libraries to connect con- 
tigs into larger scaffolds or help assemblers resolve ambiguities in 
repetitive regions of the genome (Bashir, 2012; Deshpande, 2013; 
Koren, 2012). However, their popularity in current genomic re- 
search is still limited by high cost and error rates. 

When a related genome is available, an alternative approach 
is to use this genome to guide the assembly of the target genome, 
in a method called 'reference-assisted assembly'. The first 
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reference-assisted assembly tools ahgned contigs against the ref- 
erence and ordered them according to their positions in the ref- 
erence genome. While this approach is still commonly used, it 
introduces errors when structural variations between the refer- 
ence and the assembled (target) genome are present. In an at- 
tempt to address this problem, Gaul and Blanchette (Gaul and 
Blanchette, 2006) formulated the contig ordering problem, which 
attempts to order contigs so that the 2-break distance (DCJ dis- 
tance) (Alekseyev and Pevzner, 2009; Bergeron et aL, 2006) be- 
tween the resulting scaffold and the reference genome is 
minimized. This formulation has been further applied in some 
reference-guided assembly tools (Richter et aL, 2007; Rissman 
et aL, 2009). Unfortunately, while the approach is theoretically 
sound, these tools still generate erroneous scaffolds when there 
are rearrangements between the target and reference genomes. 
This poses an important question: is a single reference genome 
sufficient to obtain a single scaffold (for each chromosome) with- 
out errors? 

Recently, Kim et aL (2013) introduced RACA software, which 
made an important step toward reliable reconstruction of the 
target (assembled) genome. In contrast to other tools, which 
use only one reference, RACA utilizes a reference as well as 
multiple outgroups to guide the assembly. This approach 
proved to be valuable, since the adjacency information in the 
outgroups can also help infer the adjacencies in the target 
assembly. 

Although RACA marked an important advancement in the 
reference-guided assembly problem, it still suffers some limita- 
tions. First, RACA uses information from outgroup genomes, 
but it heavily relies on a single reference. As with any genome 
rearrangement tools, RACA decomposes these sequences into 
synteny blocks. However, rather than constructing synteny 
blocks by considering all input sequences, RACA constructs syn- 
teny blocks based on pairwise sequence ahgnment against only 
the reference genome. This approach, in some cases, cannot 
detect synteny blocks (Pham and Pevzner, 2010) and also 
raises the question of what to do with assembly sequences (con- 
tigs) that do not align against the reference. Second, unlike syn- 
teny blocks constructed from complete genomes, synteny blocks 
constructed in the presence of contigs can be fragmented, since 
assembhes usually have contigs of various lengths. Constructing 
synteny blocks from fragmented assemblies raises a problem: on 
which scale should synteny blocks be constructed? If one con- 
structs large-scale synteny blocks, then small and fragmented 
synteny blocks (within small contigs) are not considered, thus 
leading to gaps in the assembly. On the other hand, if one con- 
structs small-scale synteny blocks, then the rearrangement 



© The Author 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ 
by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial 
re-use, please contact journals.permissions@oup.com 



Reference-assisted assembly tool for bacterial genomes 



analysis becomes harder, since smaller synteny blocks are more 
likely to exhibit structural variations and are also more suscep- 
tible to be incorrectly identified (i.e. false synteny blocks). This 
dilemma must be addressed in order to obtain high-quality 
scaffolds. 

In this work, we present Ragout (Reference-Assisted Genome 
Ordering UTility), a genome rearrangement approach for refer- 
ence-assisted assembly that can produce high-quality scaffolds 
with a small number of misordered contigs and high genome 
coverage. Rather than working with a single reference. Ragout 
uses multiple complete genomes from closely related species/ 
strains. In contrast with most existing tools, in which only a 
fixed scale of synteny blocks is used, our algorithm works itera- 
tively with different scales of synteny blocks and also utilizes the 
assembly graph to improve scaffolds. 

We demonstrate that with multiple references. Ragout signifi- 
cantly improves the assembly of the target genome compared to 
existing tools. Through simulations as well as real datasets, we 
believe that for common bacterial species, for which many com- 
plete genome sequences from related strains have been made 
available, the current high-throughput short-read-sequencing 
paradigm is sufficient to assemble into a single high-quality scaf- 
fold. The Ragout software is freely available at: https://github. 
com/fenderglass/Ragout. 

2 METHODS 

2.1 Algorithm overview 

Ragout takes as input: 

• an assembly (a set of contigs); 

• a set of related genomes; and 

• a phylogenetic tree of all the genomes (including the target 
assembly). 

It outputs high-quality scaffolds in the form of ordered lists of contigs. 

Ragout first decomposes the input sequences into synteny blocks using 
Sibelia software (Minkin et al., 2013). After this stage, these sequences are 
represented in the alphabet of synteny blocks instead of as strings of 
nucleotides. While each reference sequence is transformed into a single 
sequence of synteny blocks (a list of signed numbers), the assembly cor- 
responds to multiple sequences of synteny blocks because the target 
genome has been fragmented into multiple contigs. 

Due to this fragmentation, some adjacency information is missing. 
Ragout uses a genome-rearrangement approach to infer these missing 
adjacencies. Initially, we filter all repetitive blocks as well as blocks 
that are not present in the target assembly, since these kinds of synteny 
blocks are hurdles in most current genome-rearrangement algorithms. 
From the remaining blocks, we build an incomplete multi-color break- 
point graph, in which vertices correspond to the ends of synteny blocks 
and edges represent adjacencies between them. Ragout further solves 
the Half -breakpoint State Parsimony Problem to transform this graph 
to the 'normal' multi-color breakpoint graph (Alekseyev and Pevzner, 
2009) by recovering missing adjacencies in the target assembly. Next, 
contigs are assembled into scaffolds with respect to the inferred adja- 
cencies. The above procedure is repeated multiple times with different 
synteny block scales, and the resulting scaffolds in these iterations are 
reconciled into a single set of scaffolds. Afterwards, a refinement step 
is performed. In this step, small and repetitive contigs are recovered 
and inserted back into the scaffolds by using the adjacency information 



from the assembly graph. The pseudocode of Ragout is described in 
Algorithm 1. 

Algorithm 1 Ragout pseudocode 

procedure ^AGO\Ji:{references , target, phytogeny, blockSizes) 
assemblies ^ 0 

for all blockSize in blockSizes do 

synBlocks ^ RuNSiBELiA(re/ere/ice5, target, blockSize) 

bpGraph ^ BuiLDBREAKP0INTGRAPH(5y/l5/(9c/c^) 

weightedGraph ^ EDGEsScoRE(bpGraph, phytogeny) 
adjacencies ^ MiNPERFMATcniNGiweightedGraph) 
scaffolds ^ BuiLDScAFFOLDs(/<2r^e/, adjacencies) 
ADD(scaffolds, assemblies) 
end for 

scaffolds ^ MERGElTERATiONsiassemlies) 
assemblyGraph ^ BuiLDAssEMBLYGRAPH(/<3r^eO 
scaffolds ^ REFiNEScAFFOLDs(5c<3//(9/(is, assembly Graph) 
OvTPmScAFFOLDs(scaffolds) 
end procedure 



Since the synteny block-reconstruction algorithm used in Ragout has 
been described in (Minkin et al., 2013), we will not describe it in this 
article. Below, we delve into the details of Ragout algorithm, assuming 
that synteny blocks have already been constructed. 

2.2 A genome-rearrangement approach for recovering 
missing adjacencies 

Given reference sequences and assembly in the alphabet of synteny 
blocks, our task is to recover the missing adjacencies of synteny blocks 
in the target assembly. While breakpoint graphs best capture adjacency 
information, they have thus far only been defined for complete genomes. 
Below, we introduce incomplete multi-color breakpoint graphs for present- 
ing synteny block adjacencies in the assembly and reference sequences. 

2.2.1 Incomplete multi-color breakpoint graphs Given an assembly 
A and k reference sequences Pi, . . . , Py^^ in the alphabet of synteny blocks 
B, we define the incomplete multi-color breakpoint graph BG(A, Pi, ... , 
Pk) (V, E), where F= {b'l, b\\bi e B}. For each synteny block, there are 
two vertices in the graph which correspond to the tail and head of the 
block. Edges are undirected and colored by + 1 colors. An edge con- 
nects vertices that correspond to heads/tails of adjacent synteny blocks 
and is colored by the corresponding color of the genome/assembly. To 
simplify the notation, we use red, Pi,. . ., P^ to refer to the colors of 
edges, where red edges represent the adjacencies of synteny blocks in 
the target assembly A, and Pi represents the adjacencies of synteny 
blocks in genome Pi (see Fig. lA). 

Before constructing the graph, we filter synteny blocks from reference 
genomes that are not present in the target assembly, since these synteny 
blocks do not help to infer the adjacencies in the target genome. Also, we 
filter synteny blocks that have multiple copies within any sequences, since 
duplicated synteny blocks make further analysis in the breakpoint graph 
complicated. 

After this filtering, the constructed graph has two important proper- 
ties: (i) each vertex has at most one edge of each color (since all repetitive 
synteny blocks are removed), and (ii) each vertex corresponds to a certain 
synteny block in the target genome. Therefore, if the target genome were 
available, the set of all red edges would define a perfect matching in the 
graph. However, since the genome is fragmented into contigs, the adja- 
cency information of the target genome at the vertices that correspond to 
the end of contigs is missing. The main task is to infer these missing red 
edges in the graph using other adjacencies from the reference genomes 
as well as using their evolutionary relationship in the form of a phylo- 
genetic tree. 
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Fig. 1. (a) A breakpoint graph of three reference genomes and one as- 
sembly. The three reference genomes (Refl, Ref2 and Ref3) are presented 
as cyclic permutations of synteny blocks: Refl(blue): +1+2+3+4 
+ 5, Ref2(greeny. + 1+3+4 + 5 Siud RefSiorange): +1-4-3 + 5, 
respectively. The target assembly (red) is presented as four separated 
permutations (corresponds to four contigs): Target Assembly: +1 | + 2 + 
3 I + 4 I + 5. (b) A phylogenetic tree representing the states of the half- 
breakpoint 5'\ Each leaf is labeled by the state of the half-breakpoint 5^^ in 
the corresponding reference/target genome. (According to this tree, the 
state of 5^' in the target genome is A\ although the correct state of 5^' in the 
target genome is unknown.) 

2.2.2 The phylogenetic tree Let r be a rooted phylogenetic tree of 
the genomes A, Pi, ... , Pj^. T consists of /r + 1 leaves, {k-\) internal 
nodes of degree three, and one node of degree two (called the root). 
Edges connecting two nodes are called branches. Branch length represents 
evolution time (evolution time is usually inferred from the number of 
nucleotide substitutions in sequence alignment). 

A parsimonious model for rearrangements. Given a tree T with all 
of its /c + 1 leaves represented as complete genomes (in the alphabet of 
synteny blocks), the parsimony procedure (Fitch, 1971; Sankoff, 1975) 
explains the descent of these various sequences from a common ancestor 
with the fewest number of changing operations, associated with a min- 
imum cost value. When one of these genomes is divided into contigs, a 
possible formulation for recovering the target genome corresponds to 
finding a permutation of these contigs such that the parsimony procedure 
returns the lowest cost among all possible orderings of these contigs. Note 
that this is a double optimization. One step finds the most parsimonious 
explanation for a given configuration of the target genome, while the 
other step finds the configuration of the target genome having lowest 
cost. The usual (and more correct) choice for the allowed operation in 
the first optimization is the 2-break operation (also called a DCJ oper- 
ation) since it can well approximate many real rearrangement operations 
(Alekseyev and Pevzner, 2009). However, accounting for such an oper- 
ation will lead to an NP-complete problem (Ma et al, 2006). In this work, 
rather than using the 2-break operation, we study the parsimonious 
approach on individual breakpoints. While considering breakpoints indi- 
vidually may not adequately reflect the reality of rearrangements (since 
each rearrangement uses at least two breakpoints), this method is com- 
putationally feasible and has been proven to be valuable in the problem 
of ancestral genome reconstruction (Ma et al, 2006). 

In the breakpoint graph BG(A, Pi, . . . , P^), we call each vertex a half- 
breakpoint (since a breakpoint involves two synteny block ends). For a 
given genome Pj, the state of a half-breakpoint is defined as the half- 
breakpoint adjacent to it, i.e. the vertex vj, such that the color of the edge 
(Vj, Vj) is Pi (color) in BG. Because of the first graph property described 
above, each half-breakpoint has at most one such state. If the synteny 
block corresponding to v,- is missing in a genome, the state is void (see 
Fig. IB). 

Rather than using the parsimony procedure with 2-break operations, 
we study the parsimony procedure with respect to the change of half- 
breakpoint's states for each vertex in the breakpoint graph. The cost 
associated with a state change along a branch b of length t in the phylo- 
genetic tree is W(b) = e~^. Intuitively, the longer a branch, the more likely 



a half-breakpoint state can change along it, and therefore the smaller the 
corresponding cost. Next, we formulate the Half-breakpoint State 
Parsimony Problem in order to infer the evolutionary scenario of a par- 
ticular half-breakpoint. 

2.2.3 Half-breakpoint state parsimony Given a half-breakpoint u 
and the phylogenetic tree T, each leaf of which is labeled by state (u) in 
each corresponding genome. Label the internal nodes of T in order to min- 
imize the total cost needed to derive the leaves' states from their common 
ancestor. 

Below is a linear time algorithm for solving this problem. The solution 
for this problem mimics Sankoff s dynamic programming algorithm for 
the weighted small parsimony problem. The optimality proof is perfectly 
analogous to Sankoff s proof (Sankoff, 1975). 

• Input: state(u) for each leaf node of tree. 

• Output: state(u) for each internal node, along with the corresponding 
cost. 

• Objective function: St(v) = minimum score of the subtree rooted at 
vertex v if v has state t. 

• Initialization: for each leaf node /: St(l) = 0 if the state of leaf node is 

otherwise Stil) = oo. 

• Recursion: The score at each vertex is based on the scores of its 
children: 

^/(parent) = min/{5/(leftchild) + 8ij ^F^(leftbranch)} 

+ miny {5y(rightchild) + Sjj PF(rightbranch)} 

where 8ij = 0 /// =j and 1 otherwise. 

• Termination: mmaSa(voot). 

• Complexity: 0(n-d), where n is the number of leaves and d is the 
number of possible states. 

When all the internal nodes of the tree have already been labeled, the 
cost of half-breakpoint state parsimony of the half-breakpoint u can be 
calculated by summing the cost in all the branches that the state changes. 

Piu,T)= J2 ^/,;^(branchlength) (1) 

branch(/,/) 

2.2.4 Recovering missing adjacencies Since the half-breakpoint state 
parsimony problem can be solved efficiently, the remaining question is to 
recover all missing adjacencies such that ^^^ueG^^^' ^ — ^^^^^ ^^^^ 
all half-breakpoints in G — is minimal. Since we filtered all duplicated 
synteny blocks as well as synteny blocks that do not appear in the 
target assembly, adjacencies in the target genome define a perfect match- 
ing in the graph. The cost of this matching is defined to be the sum of the 
half-breakpoint parsimony cost of all vertices (half-breakpoints). 

For each vertex that is not incident to a red edge in the breakpoint 
graph, the Ragout algorithm finds the cost for the half-breakpoint state 
parsimony problem on each of its possible states. These states are chosen 
from the vertex's adjacent vertices in the breakpoint graph. Since choos- 
ing a state for a particular half-breakpoint corresponds to choosing an 
edge incident to it, for each edge in the breakpoint graph, two such cost 
values are calculated. We assign the weight of each edge as the sum of 
these two values. As some adjacencies in the target genome are already 
known (vertices incident to a red edge), we can also remove those vertices 
from the graph before applying the Blossom algorithm to find the perfect 
matching with minimum cost (in 0(\V l"^) time) and add corresponding 
edges into the final matching afterwards. This matching defines the op- 
timal adjacencies in the target genome. 

2.2.5 Building scaffolds Finally, we order contigs into scaffolds. 
Starting from one contig, we try to extend it in both the forward and 
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backward directions, using the inferred adjacencies from the matching 
above. 

2.3 The choice of minimum synteny block size and 
iterative assembly 

Since synteny block reconstruction is a parameter-dependent procedure, 
and the choice of parameters depends on the 'synteny block scale' 
required in each particular application, an important question to ask is: 
which scale of synteny blocks should one use for reference-assisted as- 
sembly? We argue that using a single scale of synteny blocks is not suf- 
ficient to obtain high-quality scaffolds. If one constructs large-scale 
synteny blocks, then small and fragmented synteny blocks (coming 
from small contigs and micro-rearrangements) are not considered, thus 
leading to gaps in the assembly. On the other hand, if one constructs 
small-scale synteny blocks, then rearrangement analysis becomes more 
difficult, since small-scale blocks are more likely to exhibit structural 
variations and are also more susceptible to be incorrectly identified 
(false synteny blocks). To resolve this dilemma, we use multiple synteny 
block scales in order to build multiple scaffolds and then reconcile these 
scaffolds. 

Consider two scaffolds and ^4,,, constructed from large- and small- 
scales of synteny blocks, respectively. A contig is called strong if it is in the 
scaffold A^^ and called weak if it is in A^^, and not in A^^. Two scaffolds A^^ 
and A^^, are called consistent if every pair of adjacent contigs in A^^ is either 
adjacent or separated by only weak contigs in A^^,. Although the order of 
contigs in A^ is more reliable than in A^^,, there are many small synteny 
blocks that do not reveal in A^ and only appear in the 'finer' scaffold ^„,. 
We therefore rely on the 'skeleton of contigs' in A^^ and insert smaller 
contigs from A^^, if they are consistent (see Fig. 2). 

The merged scaffold is therefore consistent with Ag. We successively 
apply the described procedure to scaffolds in different stages (sorted by 
the scale of synteny blocks) until we reach the stage with the smallest 
scale. 

2.4 Refinement with the assembly graph 

While the iterative procedure attempts to maximize the number of contigs 
used to build the scaffold, there are still some certain types of contigs that 
cannot be utilized in that stage. These contigs include: (i) contigs that are 
shorter than minimum synteny block size that existing synteny block 
tools can detect (several hundred nucleotides), (ii) contigs contained 
only in the target genome and (iii) repetitive contigs. Such fragments 
are not considered in the rearrangement analysis and will therefore not 
appear in the merged scaffolds. To add these fragments, we need to use 
the assembly graph, which has been output by short-read assemblers and 
can also be independently constructed from input contigs. The genome 
traverses the graph with a certain unknown path. However, since initial 
scaffolds are now available, we can use these scaffolds to restore small or 
repetitive fragments. This problem is analogous to the problem of repeat 
resolution in short-read assembly. Instead of paired-read information, we 
have pairs of adjacent contigs built during the rearrangement analysis 
stage. 



(a) 
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Given an assembly graph and a set of merged scaffolds from the pre- 
vious step of the algorithm, for each pair of consecutive contigs from 
these scaffolds we find all possible paths connecting them in the assembly 
graph that do not contain contigs from the scaffold. If there exists only 
one such path, we insert all the intermediate contigs along this path be- 
tween the two contigs (see Fig. 3). This procedure significantly improves 
the number of used contigs in most datasets. 



3 RESULTS 

We benchmarked Ragout against other reference-assisted assem- 
bly tools [Mauve Contig Mover (Rissman et al., 2009), OSLay 
(Richter et aL, 2007), RACA (Kim et aL, 2013)] on one simulated 
and three real bacterial datasets. Since the complete sequences of 
target genomes are available, we can also assess the quality of the 
resulting scaffolds by the number of misordered contigs, scaffold 
gaps and coverage. 

As each output scaffold is an ordered list of contigs, we define 
the number of misordered contigs as the minimum number of 
contigs in the scaffolds whose mapping positions or directions 
are not consistent with the mapping positions and directions of 
the contigs before and after them. Also, we define the number of 
gaps in a scaffold as the number of contig pairs that are adjacent 
in a scaffold, but are separated by some other contigs when we 
map all contigs to the reference genome. The coverage is defined 
as the total number of ahgned bases against the reference, 
divided by the genome size. 

Mauve Contig Mover and OSLay were run with parameters 
recommended for bacterial genomes. For Ragout, we ran three 
iterations with the corresponding minimum synteny block sizes: 
5000, 500, 100, as they are the default scales used in Sibeha 
(Minkin et al, 2013) for bacterial genomes. Since RACA 
works with only one synteny block size, we chose the maximal 
synteny block scale (most reliable). 

3.1 Assembly using one closely related reference 

First, we benchmark these tools on an easy case in which the 
target and reference genomes do not exhibit any structural vari- 
ations. In this situation, one reference is sufficient to obtain the 
correct assembly. This also allows us to compare Ragout with 
MCM and OSLay, which can work only with one reference. 

The dataset consists of two different Escherichia coli strains: 
DHl (NC_0 17625) as the reference and K-12 subs. MG1655 
(NC_000913) as the target. The contigs were assembled with 
SPAdes assembler (Bankevich et al, 2012). The results can be 
seen in Table 1 . Ragout and MCM are able to recover one com- 
plete scaffold, while OSLay outputs eight scaffolds. The quality 
of Ragout's assembly without refinement is quite similar to 
MCM, but with refinement Ragout uses significantly more con- 
tigs and produces fewer gaps in the final scaffolds. 



Fig. 2. Merging two scaffolds A^ and A^y built from two different synteny 
scales into a scaffold M. Yellow rectangles represent weak contigs. (a) A^ 
and A^y are consistent, (b) A^ and are not consistent 



Fig. 3. Refinement with the assembly graph. The procedure fills scaffold 
gaps with small contigs. The big circles illustrate contigs with known 
order, while small ones correspond to contigs that were not considered 
in rearrangement analysis 
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Table 1. Comparison of different tools using one reference 



(a) 



(b) 





Ragout 


MCM 


OSLay 


Scaffolds 


1 (1) 


1 


8 


Coverage 


97.9 (97.6) 


97.6 


96.7 


Ordered contigs 


129 (79) 


77 


80 


Gaps 


52 (71) 


73 


61 


Misordered 


0(0) 


0 


1 



All tools were run with their default parameters. For Ragout, results are given both 
with and without (in brackets) refinement. The total number of the contigs is 156. 
Initial assembly coverage is 98.18%. 



3.2 Assembly of one chromosome using multiple 
references 

Next, we want to address a more problematic case in which 
multiple references are available, but each of these references 
exhibits structural variations comparing to the target genome. 
The dataset contains five different Helicobacter Pylori strains: 
Punol20 (NC_017378), ELS37 {NC_017063), Gambia94/24 
(NC_017371) and G27 (NC_011333) as references and SJM180 
(NC_014560) as the target. The corresponding phylogenetic tree 
is shown on Figure 4A. The dot plots showing the rearrange- 
ments can be seen in Figure 5. Contigs were assembled using 
ABYSS (Simpson et al, 2009). 

First, we run Ragout as well as OSLay and MCM on each of 
the available references separately to illustrate that the 'usual' 
assisted assembly with one reference is insufficient for the current 
case. Every tool produces a certain amount of misordered con- 
tigs, which can be explained by the structural divergence between 
the reference and the target (see Table 2). 

Since OSLay and MCM can only run with one reference, we 
use Ragout and RACA to benchmark different sets of multiple 
references (see Table 3). For RACA, G27 was chosen to be the 
reference and others were treated as outgroups. Both tools have 
misordered contigs when using three or fewer references. Ragout 
is able to infer the correct order of the contigs with the set of four 
references, while RACA still has some misordered contigs. Also, 
Ragout outputs the assembly with better coverage and fewer 
gaps. 



3.3 Multiple chromosome assembly using multiple 
references 

The next dataset was taken from a recent study (Bashir et al, 
2012) in which long PacBio reads were used to obtain the com- 
plete de novo assembly of Vibrio Cholerae HI str. 
(AKGHOlOOOOOl). Here we want to show that with multiple 
related references (even though these references and the target 
genome have structural variations), complete scaffolds with high 
quality can also be obtained from only short-read data. We used 
SPAdes to assemble non-paired Illumina reads with read length 
40 bp. Three references were chosen so that each of them would 
have rearrangements in at least one chromosome compared to 
the target genome: 01 Biovar (AE003852), 01 Inaba 
{C MOO 178 5) and 0395 {C POO 123 5). See Figure 6 for the dot- 
plots. The phylogenetic tree is shown on Figure 4(B). 



Puno 120 





G27 






Gannbia94/24 






SJM180 

ELS37 






01 biovar 

0395 




Fig. 4. Phylogenetic trees, (a) Heclicobacter Pylori with SJM180 as 
target, (b) Vibrio Cholerae with HI as target, (c) Staphylococcus Aureus 
with USA300 as target, (d) Simulated genomes. Solid branches contain all 
types of rearrangements, while dashed branches contain only indels 
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Resulting sca^old 

Fig. 5. (a— d) Dot plots of H Pylori references versus target genomes, (e) 
Dot plot of Ragout's scaffold versus the target genome showing a perfect 
diagonal line for visual verification 



Using three references. Ragout was able to correctly recon- 
struct two scaffolds that correspond to two V. Cholerae chromo- 
somes (see Table 4). Refinement with the assembly graph 
significantly increases the number of utilized contigs. 
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Table 2. Comparison with MCM and OSLay using one H. Pylori refer- 
ence showing misordered contigs 



Reference 


Scaffolds 


Ordered 


Misordered 


Coverage 


Mauve contig mover 










G27 


1 


53 


7 


97.9 


Gambia94/24 


1 


54 


8 


97.9 


ELS37 


1 


45 


9 


98.0 


Punol20 


1 


56 


11 


98.0 


OSLay 










G27 


5 


50 


1 


96.2 


Gambia94/24 


8 


49 


3 


96.4 


ELS37 


6 


53 


3 


98.0 


Punol20 


8 


51 


2 


87.0 


Ragout 










G27 


1 (1) 


91 (50) 


4(4) 


97.7 (97.4) 


Gambia94/24 


2(2) 


83 (45) 


7(7) 


96.8 (96.6) 


ELS37 


1 (1) 


102 (56) 


4(3) 


98.1 (97.5) 


Punol20 


2(2) 


92 (49) 


6(6) 


97.2 (96.9) 



All tools were run with their default parameters. For Ragout, results are given both 
with and without (in brackets) refinement. The total number of contigs is 183. Initial 
assembly coverage is 98.57%. 
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Fig. 6. Dot plots of different chromosomes of V.Cholerae references (a-c) 
versus the corresponding chromosomes of the target genome showing 
rearrangements 



Table 3. Comparison of Ragout and RACA on H.pylori using multiple 
references 

Table 4. Comparison of Ragout and RACA on V.cholerae using multiple 

references 

References Scaffolds Coverage Ordered Gaps Misordered 



Ragout 

G27, ELS37 

G27, Punol20, ELS37 

G27, Punol20, 

Gambia94/24, ELS37 
RACA 

G27, ELS37 

G27, Punol20, ELS37 

G27, Punol20, 

Gambia94/24, ELS37 



References 



Scaffolds Coverage Ordered Gaps Misordered 



2 (2) 97.8 (97.6) 95 (53) 22 (39) 1 (1) 
1 (1) 97.8 (97.6) 95 (53) 21 (36) 1 (1) 
1 (1) 97.6 (97.3) 93 (46) 22 (38) 0 (0) 



3 83.6 35 29 2 

2 83.6 35 30 1 

2 83.8 35 31 1 



All tools were run with their default parameters. For Ragout, results are given both 
with and without (in brackets) refinement. The total number of contigs is 183. Initial 
assembly coverage is 98.57%. 



For RACA, 01 Inaba was chosen to be the reference and 
others were treated as outgroups. RACA outputs three correct 
scaffolds with three references and six scaffolds with two refer- 
ences. Ragout outperforms RACA by the quality of the assem- 
bly, both with and without refinement. 

3.4 Assembly using structurally divergent references 

Finally, we want to address the case when the references have a 
large number of structural variations. This case requires us to 
perform simulations because bacterial genomes usually have few 
rearrangements. 

The phylogenetic tree that was used for simulations is shown 
on Figure 4(D). For the sake of simplicity, the tree is chosen to 
be symmetrical and can be treated both as an unrooted tree or as 
a rooted one with the target branch of infinitesimal length. On 
each of the four outer branches (incident to references), five re- 
versals and five translocations were simulated. Additionally, to 



Ragout 
Ol Inaba 

Ol Inaba, Ol biovar 
Ol Inaba, Ol 

biovar, 0395 
RACA 

Ol Inaba, Ol biovar 
Ol Inaba, Ol 

biovar, 0395 



3(3) 
2(2) 
2(2) 



95.3 (94.8) 317 (185) 41 (64) 5 (3) 
95.5 (94.7) 305 (179) 46 (68) 6 (4) 
95.5 (94.7) 300 (174) 46 (66) 2 (0) 



85.8 
90.0 



124 
127 



51 

56 



All tools were run with their default parameters. For Ragout, results are given both 
with and without (in brackets) refinement. The total number of contigs is 1407. 
Initial assembly coverage is 96.89%. 



make the dataset more complex, we have simulated 10 indels on 
each of the tree branches, including all inner ones. Genomes were 
then decomposed into synteny blocks, and then the target 
genome was split into contigs, where each contig represents 
exactly one synteny block. 

Our analysis includes three cases corresponding to when four, 
three or two of the simulated references are available. For each 
case, we have generated 100 different datasets. Since the phylo- 
genetic tree is symmetric, the result does not depend on which 
particular reference is absent. We took E.coli K-12 str. MG1655 
substr. (NC_000913) as a target and the number of synteny 
blocks in decomposition was 112 in average. 

Next, we run Ragout on every dataset. The results of simula- 
tions can be seen in Figure 7, which clearly shows that the 
number of misordered contigs increases when some of the refer- 
ences are missing from the analysis. The errors in the case when 
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Table 5. Comparison of Ragout and RACA on simulated datasets 



Number of references 



Fig. 7. Correspondence of the number of available references with the 
number of misordered contigs for Ragout 





Dataset 1 




Dataset 2 




Ragout 


RACA 


Ragout 


RACA 


Scaffolds 


1 


6 


1 


8 


Coverage 


95.3 


83.5 


94.5 


75.4 


Ordered 


112 


101 


112 


90 


Gaps 


3 


6 


11 


5 


Misordered 


0 


3 


4 


2 



Ragout and RACA were run with synteny block size equal to 500 (the size is known 
from the simulations) without refinement. The total number of contigs is 1 14 in both 
cases. Initial assembly coverage is 100%. 



all references are available can be explained by breakpoint reuse, 
which makes it impossible for the algorithm to distinguish over- 
lapping rearrangements. 

To compare Ragout and RACA, we chose two datasets with 
four references, since each run of RACA requires a large number 
of manual preparations. Results can be seen in Table 5. In both 
cases. Ragout produces better assemblies than RACA; one pos- 
sible explanation for this phenomenon is that RACA heavily 
relies on one particular reference. 

3.5 Parameter choice and iterative assembly 

We would like to benchmark Ragout with different parameters 
of iterative/non-iterative assembly. The dataset consists of five 
strains of Staphylococcus Aureus as references: COL 
(NC_002951), JKD6008 (NC_0 17341), N315 (NC_002745), 
RF112 (NC_007622) and USA300 (NC_007793) as the target. 
The phylogenetic tree is shown in Figure 4(C). Contigs were 
assembled from single-cell sequencing data using SPAdes. 

The results of benchmarking can be seen in Table 6. As ex- 
pected, the smaller size of a synteny block allows the algorithm 
to arrange more contigs, but analysis becomes more complicated. 
As a result, the algorithm produces some incorrect adjacencies 
which leads to misordered contigs and more fragmented assem- 
bly (in number of scaffolds). Next, iterative assembly was per- 
formed in three stages (5000, 500, 100). This assembly kept the 
complete scaffold from the first stage, while adding some smaller 
contigs from the second and third stages. Even though some 
misordered contigs could be carried to the merged scaffolds, 
these errors are local and do not violate the correct 'skeleton' 
of scaffolds. 

4 DISCUSSION 

In this article, we have presented Ragout, a package for improv- 
ing assemblies using multiple complete references. We demon- 
strated that with multiple related genomes available, one can 
obtain a complete and high-quality scaffold for each chromo- 
some using only high-throughput short-read sequencing. This 
marks an important improvement in genome assembly of short 
reads and even raises a question whether long PacBio reads or 
long jumping libraries are needed for genomic studies of 



Table 6. Iterative assembly of S Aureus 



SB size 


100 


500 


5000 


Iterative 


Scaffolds 


5 


2 


1 


1 


Coverage 


96.7 


96.3 


92.0 


96.7 


Ordered 


108 


91 


62 


89 


Gaps 


75 


75 


56 


73 


Misordered 


1 


0 


0 


1 



Ragout was run with four S.Aureus references with different minimum synteny 
block sizes. Iterative assembly was performed with all previous sizes combined 
(5000, 500, 100). The total number of contigs is 767. Initial assembly coverage is 
98.4% . We did not perform refinement with the assembly graph in order to focus on 
the effect of synteny block size. 



common bacteria where multiple related references have been 
available. 

The current version of Ragout uses Sibelia for synteny block 
reconstruction and therefore limits itself to bacterial genomes. 
When synteny blocks have been available. Ragout is fast and 
memory-efficient. We plan to make Ragout compatible with 
other synteny block generation tools that can work with mam- 
mahan genomes [e.g. Cactus aligner Paten et al. (2011)] and fur- 
ther extend Ragout to work with mammahan datasets. Another 
limitation of Ragout is that it only uses the assembly graph for 
recovering repetitive blocks or small contigs that could not be 
captured in synteny analysis. Therefore, it can make mistakes 
when rearrangements happened on the target branch. Since de 
Bruijn graphs can be transformed into breakpoint graphs and 
vice-versa, the de Bruijn graphs output from short-read assem- 
blers can also be used for rearrangement analysis and we will 
focus on this issue in further studies. 
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